In [ ]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sb
sb.set_style('whitegrid')
import requests
import json
import re
from collections import Counter
from bs4 import BeautifulSoup
import string
import nltk
import networkx as nx
In [ ]:
string.ascii_letters
There's also punctionation and digits
In [ ]:
string.punctuation
In [ ]:
string.digits
A string is basically a list of these mapping numbers. We can use some other functions and methods to analyze a string like we would with other iterables (like a list).
The len
of a string returns the number of characters in it.
In [ ]:
len('Brian')
Two (or more) strings can be combined by adding them together.
In [ ]:
'Brian' + ' ' + 'Keegan'
Every character is mapped to an underlying integer code.
In [ ]:
ord('B')
In [ ]:
ord('b')
We can also use chr
to do the reverse mapping: finding what character exists at a particular numeric value.
In [ ]:
chr(66)
When you're doing comparisons, you're basically comparing these numbers to each other.
In [ ]:
'b' == 'B'
Here's the first 128 characters. Some of these early characters aren't single characters, but are control characters or whitespace characters.
In [ ]:
[(i,chr(i)) for i in range(128)]
You'll notice that this ASCII character mapping doesn't include characters that have accents.
In [ ]:
s = 'Beyoncé'
This last character é
also exists at a specific location.
In [ ]:
ord('é')
In [ ]:
chr(233)
However, the way that Python performs this mapping is not the same for computers everywhere else in the world. If we use the popular UTF-8 standard to encode this string into generic byte-level representation, we get something interesting:
In [ ]:
b = s.encode('utf8')
b
The length of this b
string somehow got a new character in it compared to the original s
string.
In [ ]:
print(s,len(s))
print(b,len(b))
If we try to discover where these characters live and then map them back, we run into problems.
In [ ]:
ord(b'\xc3'), ord(b'\xa9')
In [ ]:
chr(195), chr(169)
We can convert from this byte-level representation back into Unicode with the .decode
method.
In [ ]:
b.decode('utf8')
In [ ]:
ord(b'\xc3\xa9'.decode('utf8')), chr(233)
Using a different decoding standard like CP1252 returns something much more grotesque without throwing any errors.
In [ ]:
b.decode('cp1252')
There are many, many kinds of character encodings for representing non-ASCII text.
Other resources on why Unicode is what it is by Ned Batchelder, this tutorial by Esther Nam and Travis Fischer, or this Unicode tutorial in the docs.
In [ ]:
for codec in ['latin1','utf8','cp437','gb2312','utf16']:
print(codec.rjust(10),s.encode(codec), sep=' = ')
You will almost certainly encounter string encoding problems whenever you work with text data. Let's look at how quickly things can go wrong trying to decode a string when we don't know the standard.
Some standards map the \xe9
byte-level representation to the é
character we intended, while other standards have nothing at that byte location, and still others map that byte location to a different character.
In [ ]:
montreal_s = b'Montr\xe9al'
for codec in ['cp437','cp1252','latin1','gb2312','iso8859_7','koi8_r','utf8','utf16']:
print(codec.rjust(10),montreal_s.decode(codec,errors='replace'),sep=' = ')
How do you discover the proper encoding given an observed byte sequence? You can't. But you can make some informed guesses by using a library like chardet
to find clues based on relative frequencies and presence of byte-order marks.
What is your system's default? More like what are the defaults. The situation on PCs is generally a hot mess with Microsoft's standards like CP1252 competing with international standards like UTF-8, but Mac's generally try to keep everything in UTF-8.
In [ ]:
import sys, locale
expressions = """
locale.getpreferredencoding()
my_file.encoding
sys.stdout.encoding
sys.stdin.encoding
sys.stderr.encoding
sys.getdefaultencoding()
sys.getfilesystemencoding()
"""
my_file = open('dummy', 'w')
for expression in expressions.split():
value = eval(expression)
print(expression.rjust(30), '=', repr(value))
Whenever you encounter problems with character encoding issues and you cannot discover the original encoding (utf8, latin1, cp1252 are always good ones to start with), you can try to ignore or replace the characters.
In [ ]:
montreal_s.decode('utf8')
In [ ]:
for error_handling in ['ignore','replace']:
print(error_handling,montreal_s.decode('utf8',errors=error_handling),sep='\t')
Unfortuantely, only tears, fist-shaking, and hair-pulling will give you the necessary experience to handle the inevitability of character encoding issues when working with textual data.
Load the data from disk into memory. See Appendix 1 at the end of the notebook for more details.
In [ ]:
with open('potus_wiki_bios.json','r') as f:
bios = json.load(f)
Confirm there are 44 presidents (shaking fist at Grover Cleveland, the 22nd and 24th POTUS) in the dictionary.
In [ ]:
print("There are {0} biographies of presidents.".format(len(bios)))
What's an example of a single biography? We access the dictionary by passing the key (President's name), which returns the value (the text of the biography).
In [ ]:
example = bios['Grover Cleveland']
print(example)
We are going to discuss how to process large text documents using athe Natural Language Toolkit library.
We first have to download some data corpora and libraries to use NLTK. Running this block of code should pop up a new window with four blue tabs: Collections, Corpora, Models, All Packages. Under Collections, Select the entry with "book" in the Identifier column and select download. Once the status "Finished downloading collection 'book'." prints in the grey bar at the bottom, you can close this pop-up.
You should only need to do this next step once for each computer you're using NLTK.
In [ ]:
# Download a specific lexicon for the sentiment analysis in the next lecture
nltk.download('vader_lexicon')
# Opens the interface to download all the other corpora
nltk.download()
An important part of processing natural language data is normalizing this data by removing variations in the text that the computer naively thinks are different entities but humans recognize as being the same. There are several steps to this including case adjustment (House to house), tokenizing (finding individual words), and stemming/lemmatization ("tried" to "try").
This figure is a nice summary of the process of pre-processing your text data. The HTML to ASCII data step has already been done with the get_page_content
function in the Appendix.
In the case of case adjustment, it turns out several of the different "words" in the corpus are actually the same, but because they have different capitalizations, they're counted as different unique words.
How many words are in President Fillmore's article?
A biography can be represented as a single large string (as it is now), but this huge string is not very helpful for analyzing features of the text until the string is segmented into "tokens", which include words but also hyphenated phrases or contractions ("aren't", "doesn't", etc.)
There are a variety of different segmentation/tokenization strategies (with different tradeoffs) and corresponding methods implemented in NLTK.
We could employ a naive approach of splitting on spaces. This turns out to create words out of happenstance punctuation.
In [ ]:
example_ws_tokens = example.split(' ')
print("There are {0:,} words when splitting on white spaces.".format(len(example_ws_tokens)))
example_ws_tokens[:25]
We could use regular expressions to split on repeated whitespaces.
In [ ]:
example_re_tokens = re.split(r'\s+',example)
print("There are {0:,} words when splitting on white spaces with regular expressions.".format(len(example_re_tokens)))
example_re_tokens[0:25]
It's clear we want to separate words based on other punctuation as well so that "Darkness," and "Darkness" aren't treated like separate words. Again, NLTK has a variety of methods for doing word tokenization more intelligently.
word_tokenize
is probably the easiest-to-recommend
In [ ]:
example_wt_tokens = nltk.word_tokenize(example)
print("There are {0:,} words when splitting on white spaces with word_tokenize.".format(len(example_wt_tokens)))
example_wt_tokens[:25]
But there are others like wordpunct_tokenize
tha makes different assumptions about the language.
In [ ]:
example_wpt_tokens = nltk.wordpunct_tokenize(example)
print("There are {0:,} words when splitting on white spaces with wordpunct_tokenize.".format(len(example_wpt_tokens)))
example_wpt_tokens[:25]
Or Toktok
is still another word tokenizer.
In [ ]:
toktok = nltk.ToktokTokenizer()
example_ttt_tokens = toktok.tokenize(example)
print("There are {0:,} words when splitting on white spaces with TokTok.".format(len(example_ttt_tokens)))
example_ttt_tokens[:25]
There are a variety of strategies for splitting a text document up into its constituent words, each making different assumptions about word boundaries, which results in different counts of the resulting tokens.
In [ ]:
for name,tokenlist in zip(['space_split','re_tokenizer','word_tokenizer','wordpunct_tokenizer','toktok_tokenizer'],[example_ws_tokens,example_re_tokens,example_wt_tokens,example_wpt_tokens,example_ttt_tokens]):
print("{0:>20}: {1:,} total tokens, {2:,} unique tokens".format(name,len(tokenlist),len(set(tokenlist))))
In [ ]:
example_wpt_lowered = [token.lower() for token in example_wpt_tokens]
unique_wpt = len(set(example_wpt_tokens))
unique_lowered_wpt = len(set(example_wpt_lowered))
difference = unique_wpt - unique_lowered_wpt
print("There are {0:,} unique words in example before lowering and {1:,} after lowering,\na difference of {2} unique tokens.".format(unique_wpt,unique_lowered_wpt,difference))
In [ ]:
nltk.FreqDist(example_wpt_lowered).most_common(25)
NLTK helpfully has a list of stopwords in different languages.
In [ ]:
english_stopwords = nltk.corpus.stopwords.words('english')
english_stopwords[:10]
We can also use string
module's "punctuation" attribute as well.
In [ ]:
list(string.punctuation)[:10]
Let's combine them so get a list of all_stopwords
that we can ignore.
In [ ]:
all_stopwords = english_stopwords + list(string.punctuation) + ['–']
We can use a list comprehension to exclude the words in this stopword list from analysis while also gives each word similar cases. This is not perfect, but an improvement over what we had before.
In [ ]:
wpt_lowered_no_stopwords = []
for word in example_wpt_tokens:
if word.lower() not in all_stopwords:
wpt_lowered_no_stopwords.append(word.lower())
fdist_wpt_lowered_no_stopwords = nltk.FreqDist(wpt_lowered_no_stopwords)
fdist_wpt_lowered_no_stopwords.most_common(25)
The distribution of word frequencies, even after stripping out stopwords, follows a remarkable strong pattern. Most terms are used infrequently (upper-left) but a handful of terms are used repeatedly! Zipf's law states:
"the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc."
In [ ]:
freq_counter = Counter(fdist_wpt_lowered_no_stopwords.values())
f,ax = plt.subplots(1,1)
ax.scatter(x=list(freq_counter.keys()),y=list(freq_counter.values()))
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('Term frequency')
ax.set_ylabel('Number of terms')
Lemmatization (and the related concept of stemming) are methods for dealing with conjugated words. Word like "ate" or "eats" are counted as distinct from "eat", although semantically they are similar and should likely be grouped together. Where stemming just removes commons suffixes and prefixes, sometimes resulting in mangled words, lemmatization attempts returns the root word. However, lemmatization can be extremely expensive computationally, which does not make it a good candidate for large corpora.
The get_wordnet_pos
and lemmatizer
functions below work eith each other to lemmatize a word to its root. This involves attempting to discover the part-of-speech (POS) for each word and passing this POS to NLTK's lemmatize function, ultimately returning the root word (if it exists in the "wordnet" corpus).
In [ ]:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN
def lemmatizer(token):
token,tb_pos = nltk.pos_tag([token])[0]
pos = get_wordnet_pos(tb_pos)
lemma = wnl.lemmatize(token,pos)
return lemma
Loop through all the tokens in wpt_lowered_no_stopwords
, applying the lemmatizer
function to each. Then inspect 25 examples of words where the lemmatizer changed the word length.
In [ ]:
wpt_lemmatized = [lemmatizer(t) for t in wpt_lowered_no_stopwords]
[(i,j) for (i,j) in list(zip(wpt_lowered_no_stopwords,wpt_lemmatized)) if len(i) != len(j)][:25]
In [ ]:
def text_preprocessor(text):
"""Takes a large string (document) and returns a list of cleaned tokens"""
tokens = nltk.wordpunct_tokenize(text)
clean_tokens = []
for t in tokens:
if t.lower() not in all_stopwords and len(t) > 2:
clean_tokens.append(lemmatizer(t.lower()))
return clean_tokens
We can apply this function to every presidential biography (this may take a minute or so) and write the resulting list of cleaned tokens to the "potus_wiki_bios_cleans.json" file. We'll use this file in the next lecture as well.
In [ ]:
# Clean each bio
cleaned_bios = {}
for bio_name,bio_text in bios.items():
cleaned_bios[bio_name] = text_preprocessor(bio_text)
# Save to disk
with open('potus_wiki_bios_cleaned.json','w') as f:
json.dump(cleaned_bios,f)
In [ ]:
potus_total_words = {}
for bio_name,bio_text in cleaned_bios.items():
potus_total_words[bio_name] = len(bio_text)
pd.Series(potus_total_words).sort_values(ascending=False)
How many unique words?
In [ ]:
potus_unique_words = {}
for bio_name,bio_text in cleaned_bios.items():
potus_unique_words[bio_name] = len(set(bio_text))
pd.Series(potus_unique_words).sort_values(ascending=False)
The lexical diversity is the ratio of unique words to total words. Values closer to 0 indicate the presence of repeated words (low diversity) and values closer to 1 indicate words used only once (high diversity).
In [ ]:
def lexical_diversity(token_list):
unique_tokens = len(set(token_list))
total_tokens = len(token_list)
if total_tokens > 0:
return unique_tokens/total_tokens
else:
return 0
In [ ]:
potus_lexical_diversity = {}
for bio_name,bio_text in cleaned_bios.items():
potus_lexical_diversity[bio_name] = lexical_diversity(bio_text)
pd.Series(potus_lexical_diversity).sort_values(ascending=False)
We can count how often a word occurs in each biography.
In [ ]:
# Import the Counter function
from collections import Counter
# Get counts of each token from the cleaned_bios for Grover Cleveland
cleveland_counts = Counter(cleaned_bios['Grover Cleveland'])
# Convert to a pandas Series and sort
pd.Series(cleveland_counts).sort_values(ascending=False).head(25)
In [ ]:
potus_word_counts = {}
for bio_name,bio_text in cleaned_bios.items():
potus_word_counts[bio_name] = Counter(bio_text)
potus_word_counts_df = pd.DataFrame(potus_word_counts).T
potus_word_counts_df.to_csv('potus_word_counts.csv',encoding='utf8')
print("There are {0:,} unique words across the {1} presidents.".format(potus_word_counts_df.shape[1],potus_word_counts_df.shape[0]))
Which words occur the most across presidential biographies?
In [ ]:
potus_word_counts_df.sum().sort_values(ascending=False).head(20)
In [ ]:
Step 2: Compute some descriptive statistics about the company articles with the most words, most unique words, greatest lexical diversity, most used words across articles, and number of unique words across all articles.
In [ ]:
Functions and operations to scrape the most recent (10 August 2018) Wikipedia content from every member of "Category:Presidents of the United States".
The get_page_content
function will get the content of the article as HTML and parse the HTML to return something close to a clean string of text. The get_category_subcategories
and get_category_members
will get all the members of a category in Wikipedia.
In [ ]:
def get_page_content(title,lang='en',redirects=1):
"""Takes a page title and returns a (large) string of the HTML content
of the revision.
title - a string for the title of the Wikipedia article
lang - a string (typically two letter ISO 639-1 code) for the language
edition, defaults to "en"
redirects - 1 or 0 for whether to follow page redirects, defaults to 1
parse - 1 or 0 for whether to return the raw HTML or paragraph text
Returns:
str - a (large) string of the content of the revision
"""
bad_titles = ['Special:','Wikipedia:','Help:','Template:','Category:','International Standard','Portal:','s:','File:','Digital object identifier','(page does not exist)']
# Get the response from the API for a query
params = {'action':'parse',
'format':'json',
'page':title,
'redirects':redirects,
'prop':'text',
'disableeditsection':1,
'disabletoc':1
}
url = 'https://{0}.wikipedia.org/w/api.php'.format(lang)
req = requests.get(url,params=params)
json_string = json.loads(req.text)
new_title = json_string['parse']['title']
if 'parse' in json_string.keys():
page_html = json_string['parse']['text']['*']
# Parse the HTML into Beautiful Soup
soup = BeautifulSoup(page_html,'lxml')
# Remove sections at end
bad_sections = ['See_also','Notes','References','Bibliography','External_links']
sections = soup.find_all('h2')
for section in sections:
if section.span['id'] in bad_sections:
# Clean out the divs
div_siblings = section.find_next_siblings('div')
for sibling in div_siblings:
sibling.clear()
# Clean out the ULs
ul_siblings = section.find_next_siblings('ul')
for sibling in ul_siblings:
sibling.clear()
# Get all the paragraphs
paras = soup.find_all('p')
text_list = []
for para in paras:
_s = para.text
# Remove the citations
_s = re.sub(r'\[[0-9]+\]','',_s)
text_list.append(_s)
final_text = '\n'.join(text_list).strip()
return {new_title:final_text}
def get_category_subcategories(category_title,lang='en'):
"""The function accepts a category_title and returns a list of the category's sub-categories
category_title - a string (including "Category:" prefix) of the category name
lang - a string (typically two letter ISO 639-1 code) for the language edition,
defaults to "en"
Returns:
members - a list containing strings of the sub-categories in the category
"""
# Replace spaces with underscores
category_title = category_title.replace(' ','_')
# Make sure "Category:" appears in the title
if 'Category:' not in category_title:
category_title = 'Category:' + category_title
_S="https://{1}.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle={0}&cmtype=subcat&cmprop=title&cmlimit=500&format=json&formatversion=2".format(category_title,lang)
json_response = requests.get(_S).json()
members = list()
if 'categorymembers' in json_response['query']:
for member in json_response['query']['categorymembers']:
members.append(member['title'])
return members
def get_category_members(category_title,depth=1,lang='en'):
"""The function accepts a category_title and returns a list of category members
category_title - a string (including "Category:" prefix) of the category name
lang - a string (typically two letter ISO 639-1 code) for the language edition,
defaults to "en"
Returns:
members - a list containing strings of the page titles in the category
"""
# Replace spaces with underscores
category_title = category_title.replace(' ','_')
# Make sure "Category:" appears in the title
if 'Category:' not in category_title:
category_title = 'Category:' + category_title
_S="https://{1}.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle={0}&cmprop=title&cmnamespace=0&cmlimit=500&format=json&formatversion=2".format(category_title,lang)
json_response = requests.get(_S).json()
members = list()
if depth < 0:
return members
if 'categorymembers' in json_response['query']:
for member in json_response['query']['categorymembers']:
members.append(member['title'])
subcats = get_category_subcategories(category_title,lang=lang)
for subcat in subcats:
members += get_category_members(subcat,depth-1)
return members
Use get_category_members
to get all the immediate members (depth=0) of "Category:Presidents of the United States."
In [ ]:
presidents = get_category_members('Presidents_of_the_United_States',depth=0)
presidents
Loop through the fourth through rest of the presidents
list and get each president's biography using get_page_content
. Store the results in the presidents_wiki_bios
dictionary.
In [ ]:
presidents_wiki_bios = {}
for potus in presidents[3:]:
presidents_wiki_bios.update(get_page_content(potus))
Save the data to a JSON file.
In [ ]:
with open('potus_wiki_bios.json','w') as f:
json.dump(presidents_wiki_bios,f)
Wikipedia maintains a (superficially) up-to-date List of S&P 500 companies, but not a category of the constituent members. Like the presidents, we want to retrieve a list of all their Wikipedia articles, parse their content, and perform some NLP tasks.
First, get the content of the article so we can parse out the list.
In [ ]:
title = 'List of S&P 500 companies'
lang = 'en'
redirects = 1
params = {'action':'parse',
'format':'json',
'page':title,
'redirects':1,
'prop':'text',
'disableeditsection':1,
'disabletoc':1
}
url = 'https://en.wikipedia.org/w/api.php'
req = requests.get(url,params=params)
json_string = json.loads(req.text)
if 'parse' in json_string.keys():
page_html = json_string['parse']['text']['*']
# Parse the HTML into Beautiful Soup
soup = BeautifulSoup(page_html,'lxml')
The hard way to get a list of the company names out is parsing the HTML table. We:
soup
company_names
In [ ]:
company_names = []
# Get the first table
component_stock_table = soup.find_all('table')[0]
# Get all the rows after the first (header) row
rows = component_stock_table.find_all('tr')[1:]
# Loop through each row and extract the title
for row in rows:
# Get all the links in a row
links = row.find_all('a')
# Get the title in the 2nd cell from the left
title = links[1]['title']
# Add it to the company_links
company_names.append(title)
print("There are {0:,} titles in the list".format(len(set(company_names))))
The easy eay is to use use pandas's read_html
function to parse the table into a DataFrame and access the "Security" (second) column.
In [ ]:
company_df = pd.read_html(str(component_stock_table),header=0)[0]
company_df.head()
In [ ]:
company_names = company_df['Security'].tolist()
Now we can use get_page_content
to get the content of each company's page and add it to the sp500_articles
dictionary.
In [ ]:
sp500_articles = {}
for company in set(company_links):
sp500_articles.update(get_page_content(company))
Save the data to a JSON file.
In [ ]:
with open('sp500_wiki_articles.json','w') as f:
json.dump(sp500_articles,f)